-
Notifications
You must be signed in to change notification settings - Fork 548
feat(benchmark): Create mock LLM server for use in benchmarks #1403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #1403 +/- ##
===========================================
+ Coverage 71.66% 71.88% +0.22%
===========================================
Files 171 174 +3
Lines 17020 17154 +134
===========================================
+ Hits 12198 12332 +134
Misses 4822 4822
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
… of this into endpoints
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
@@ -0,0 +1,14 @@ | |||
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be updating the copyright date on new files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glad you pointed this out. We should update our LICENSE.md. I'll open a PR
|
||
def get_latency_seconds(config: ModelSettings, seed: Optional[int] = None) -> float: | ||
"""Sample latency for this request using the model's config | ||
Very inefficient to generate each sample singly rather than in batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a comment about the sampling method here? All you're doing is generating random numbers. Or are you saying that because batch=1 and batch=n are very different in real inference, we are not sampling realistically?
# | ||
# class TestGetResponse: | ||
# """Test the get_response function.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented code?
return response | ||
|
||
|
||
@app.post("/v1/completions", response_model=CompletionResponse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you using completions in your benchmarking? If not, I think it is better not to support this legacy interface (https://platform.openai.com/docs/api-reference/completions/create)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you Tim.
Please have a look at my comment about completion interface.
Also made some changes to fix run_server.py code coverage in review pr.
Summary
This PR adds a Mock LLM and Guardrails and example Content-Safety configuration to use this end-to-end with Guardrails. I have a follow-on PR using Locust to run performance benchmarks on Guardrails on a laptop without any NVCF function calls, local GPUs, or modifications to the Guardrails code.
Description
This PR includes an OpenAI-compatible Mock LLM Fast API app. This is intended to mock production LLMs for performance-testing purposes. The configuration file comes from a .env file, such as below for the Content Safety mock.
The Mock LLM first decides randomly if it should return a safe response or not, using the
UNSAFE_PROBABILITY
probability. This determines whetherSAFE_TEXT
orUNSAFE_TEXT
is returned when the model responds. The Mock LLM then samples latency for the response from a normal distribution (parameterized byLATENCY_MEAN_SECONDS
andLATENCY_STD_SECONDS
), and clips the minimum and maximum values againstLATENCY_MIN_SECONDS
andLATENCY_MAX_SECONDS
respectively.After waiting, it then responds with the text.
Test Plan
This test-plan shows how the Mock LLM can be integrated with Guardrails seamlessly. As long as we characterize our Nemoguard and Application LLM latency correctly and can represent them with a distribution, we can use this to perform performance testing.
Terminal 1 (Content Safety Mock)
Terminal 2 (Content Safety Mock)
Terminal 3 (Guardrails production code)
Terminal 4 (Client issuing request)
Related Issue(s)
Checklist